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(54) System and method for voiced Interface with hyperllnked Information 



(57) An inproved audio browser is disclosed. In an 
exenplary embodiment, a plurality of hypertext links 
(hereafter called "hyperlink words") available from, for 
escample. a World Wide Web document, are used as a 
vocabulary of a speech recognizer for an audio 
browser. These hyperlink words are read to the user in 
the ordinary course of the audio browser's "speaking 
voice" - such hyperlink words being identified to the 
user by, for example, a change in voice characteristics 
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for the "speaking voice." When a user wishes to select a 
hyperlink word, the user merely repeats the hyperlink 
word itself, rather than speaking a command or using a 
DTMF tone. The speech recognizer, which has as its 
vocabulary sortie or ail of the hyperlink words of the 
document, recognizes the spoken hyperlink word and 
causes the jump to the linked address associated with 
the recognized hyperlink word. 
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Description 

FIELD OF THE INVENtlON 

This application is related to the art of user interac- 
tion with stored inlbrmation. and more particularly, to 
such an interaction via spoken dialogue. 

INTRODUCTION TO THE INVENTION 

Software programs, known as "browsers." are pop- 
ularly used for providing easy access to that portion of 
the Internet known as the World Wide Web (WWW). 
Exam|3les of such browsers include tiie Netscape Navi- 
gator, available from Netscape Communications, Inc., 
and tiie Internet Explorer, available from Microsoft Cor- 
poration. These browsers are textual and graphical user 
interlaces which aid a conputer user in requesting and 
displaying information from tiie WWW. Information dis- 
playisd by a browser includes documents (or "pages'} 
which comprise images, text, sound, graphics arid 
hyperlinks, often referred to as "hypertext." 

Hypertext is a graphical representation, in text form, 
of anotiier address (typically of another document) 
where information may be found. Such information usu- 
ally relates to tiie information content conveyed by the 
"text." The hypertext is not usually the address itself, but 
text conveying some Information which may be of inter- 
est to the user. When a user selects a piece of hypertext 
(for example, by a mouse "click"), ttie browser will typi- 
cally request another document from a server based on 
an address associated with tiie hypertext In this sense, 
the hypertext is a link to the document at tiie associated 
address. 

In addition to the conventional computer software 
browsers, ottier types of browsers are known. Audio 
browsers approximate the functionality of computer 
browsers by "reading" WWW document text to a user 
(listener). Audio browsers are particularly useful for per- 
sons who are visually Impaired or persons who cannot 
access a computer but can access a telephone. Read- 
ing of text is accomplished by conventional text-to- 
speech (TTS) technology or by playing back pre- 
recorded sound. Hypertext is Indicated to the listener by 
audible delimiters, such as a "beep" before and after the 
hypertext, or by a change of voice characteristics when 
hypertext is spoken to tiie listener. When a listener 
wishes to jump to the linked address associated witti tiie 
hypertext, tiie listener replies witti eitiier a DTMF tone 
(I.e.. a touch-tone) or speaks a command word such as 
"jump" or "link," which is recognized by an automatic 
speech recognition system. In eitiier case, tiie audio 
browser interprets the reply as a command to retrieve 
tiie document at tfie address associated with the hyper- 
text link just read to tiie listener. 



SUMAflARY OF INVENTION 

TTie present invention is directed at an improved 
audio browser. The inventor of the present invention has 

5 recognized that conventional audio browsers have a 
limitation which has to do with the use of simple com- 
mand words or tones to select a hyperlink. In particular, 
tiie inventor has recognized tiiat because the same 
command or tone is used to indicate a desire to jump to 

10 any hypertext-linked address, a conventional audio 
browser forces a listener (user) to select a given hyper- 
text link before tiie listener is presented witti the next 
hypertext link. Since hypertext links may be presented 
in rapid succession, or because a user may not know 

IS which hyperlink to select until tiie user hears additional 
hyperlinks, users of such audio browsisrs must use 
rewind and play commands to facilitate tiie selection of 
hypertext which was read but not selected prior to tiie 
reading of the next piece of hypertext. 

20 The inventor of tiie present invention has further 
recognized that features of a speech recognition tech- 
nique employed in conputer browsers for sighted per- 
sons are useful in improving browsers meant for 
persons who cannot see a computer screen. See, e.g., 

2S US. Patent Application Serial No. 08/460.955. filed on 
June 5. 1995, whk:h is hereby incorporated by reference 
as if fully disck)sed herein. 

In accordance witii an embodiment of the present 
invention, a plurality of hypertext links (or, somewhat 

30 more descriptively, "hyperlink words") available from, for 
example, a WWW document, are used as a vbcabulary 
of a speech recognizer lor an audio browser. These 
hyperlink words are read to the user in the ordinary 
course of ttie audio browser's "speaking voice" - such 

35 hyperlink words being identified to tiie user by, for exam- 
ple, a change in voice characteristics for the "speaking 
voice." When a user wishes to select a hyperlink word, 
ttie user merely repeats the hyperlink word itself, ratiier 
than speaking a command or using a DTMF tone, as 

40 with prior art audio browsers. The speech recognizer, 
which has as its vocabulary some or all of ttie hyperlink 
words of tiie document, recognizes the spoken hyper- 
link word and causes tiie jump to tiie linked address 
associated witti the recognized hyperlink word. 

45 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 provides a schematic depiction of a prior art 
information access system. 
so FIG. 2 provides a schematic depiction of tiie voiced 
information access system of the invention. 

FIG. 3 provkJes a more detailed view of some of tiie 
functions shown schematically In Figure 2. 

FIG. 4 provides a schematic depiction of an embod- 
ss iment of the system of ttie invention where information 
provWed as prerecorded voice or otiier audio content. 
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DETAILED DESCRIPTION 

In the contemporary environment, an interface 
between a user and some information of interest to that 
user via an electronic medium has become almost ubiq- 
uitous. A typical illustration of such an interface is shown 
In Figure 1. where a user, situated at User's Audio (e.^., 
telephonic) Terminal 101 , obtains access via a commu- 
nications path, illustratively depicted as a Public 
Switched Telephone Network (PSTN) 110. to an Audio 
Sendng Node 120, in which Audio Server 122 provides 
an interface for the user to information stored in an 
associated database (Data Storage 121). 

As also shown in the figure, a user might also 
obtain access to desired infomiation from a text or 
graphics-based medium, such as User's Data (e.^.. 
computer) Terminal 102. The user obtains access via a 
communications path, illustratively depicted as PSTN 
110, to a Data Serving Node 130, in which Data Sender 
132 provides an interface for the user to information 
stored in an associated database (Data Storage 131). 

While it is known to provide access from such a text 
or a graphics-based interface device to highly complex 
and multi-layered information sources, the voice-based 
interfaces known In the prior art are able to provide 
access to only a highly limited scope of such informa- 
tion, as described hereinbefbre. 

It is, however, well known in the art to provide text- 
based information (including transactional options) 
arranged either in linked layers of increasing (or 
decreasing) complexity and/or detail, or in a network of 
links designating logical relationships. Where informa- 
tion is arranged in hierarchial layers, linkages between 
such layers are typically established on the basis of key 
words or phrasies deployed in a particular layer, where 
each such key word provides a linkage to related infor- 
mation, typically in another layer. While the discussion 
herein is focused on access to information stored In 
hierarchial layers. It should be noted that this usage is 
eixemplary and Is not intended to limit the scope of the 
inviention. In fact, the Invention pertains to all types of 
logical linkages. 

A highly-used case of such a text-basied set of hier- 
archial linked information layers is found in the mettiod 
known as HyperText Markup Language, or HTML 
HTML provides important functionality for the Wald 
Wide Web. With the WWW, an initial layer, or "home 
page", Is presented to a user, with that home page typi- 
cally offering a comparatively high level description of 
information related to tiie subject matter or application 
associated witii that Web site. For a user wishing to pur- 
sue more detail, or particular transactions, related to 
that home pag& information, key words or phrases are 
highlighted in the home page text, which are linked to 
such greater detail and/or specific transactions - such 
links being provided by ttie HTML functionality. 

In a typical HTML application, a page of text would 
be displayed to a user on a monitor associated witii a 



personal computer (the initial such page typically called 
the home page), with hypertext (or hyperlink words) in 
tiiat text displayed in a particular color and underlined, 
or in some other way differentiated from the typeface 

5 associated with the regular text. A user wishing to 
access the underlying (or related) information for such a 
hyperlink word would locate the hypertext wHh a mouse 
pointer or cursor, arid sijgnal an Intent to aceess the 
underlying infbrmatton by either clicking a mouse button 

10 or pressing the "enter" key on a keyboard. 

I. Inti-oduction To An IHustratrve Process In Accordance 
With The Invention. 

15 In accordance with an illustrative embodiment, a 
voiced user interface to a layered set of interlinked infor- 
mation is provided through an initial estabiishmeht of 
tiie desired information database as a text-based set of 
linked HTML layers (hereafter sometimes called HTML 

20 'ipages"). These pages may be stored at a single server 
or at a plurality of networked senders, in accordance 
witii an embodiment of the invention, the text of a given 
HTML page is tiien caused to be tranislated to a voiced 
form, where hyperlink words in ttiat text are rendered in 

25 a distinctive voicing from that of other text The user 
Interacts wHh this voiced Inforrnation system by repeat- 
ing (/.e.. voicing) a hyperlink word representing a point 
where additional, related Infbrmation is desired, and an 
automatic speech recognition system recognizes an 

30 Utterance of a given hyperlink word by tiie user. Upon 
such recognition of tiie given hyperlink word, a jump is 
made to the infbrmation layer con'esponding to that 
given hyperlink word and thereafter the text of the new 
infbmriation layer is caused to be translated to a voiced 

35 form. 

In accordance witii tiie embodiment of tiie inven- 
tion, tiie text of an HTML page is converted to a voiced 
form. That voiced HTML text will then be played to a 
user via any of numerous well known communication 

40 links, including, in the preferred embodiment, a tele- 
phonic link. Such a translation of text to voiced form Is 
very well known and typically would be carried out by a 
text-to-speech syntiiesizer (TTS). Such TTS systems 
are themselves well known. Exemplary such TTS sys- 

45 terns are described in U.S. Patents Nos. 4.685,135; 
5,157,759; and 5.204,905. 

Because a user interbcing with the voiced Informa- 
tion service of ttie embodiment will indicate an interest 
in exploring anotiier layer of the linked information by a 

50 response directed to tiie hyperlink word related to ttie 
additional information, it is desirable ttiat the voiced 
information provide an aural distinction between a 
hyperlink word and other voiced text There are various 
known methods in the TTS art for creating voicing dis- 

55 tinction as to different portions of a syntiiesized text. 
One exemplary such method, which represents an illus- 
ti-ative embodiment of tiie invention, is to cause the ordi- 
nary text to be provided in a male voice and ttie 
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hyperlink word to be rerxJered in a female voice, or vice 
versa. The changing of voices in the TTS art is a well 
known process. 

As a user Is listening to the voiced text for a given 
layer of information, and hears a hyperlink word, the s 
user has two choices. He can continue listening to the 
enunciated text (conresponding to a continued reading 
of an HTML page). Alternatively, if the hyperlink word 
prompts a desire to pursue more detailed Information 
related to that hyperlink word, he can indicate, by 10 
repeating the word, his selection of the word. That 
voiced user response wilt be conveyed to a speech rec- 
ognizer associated with the information system via a 
communications link, which may be the same communi- 
cations link as used for providing the enunciated infor- is 
mation text to the user. Such speech recognizers are 
also well known in the art. 

The function of the speech recognizer in the system 
of the invention is to recognize the voiced response of 
the user as either one of the hyperlink words in an infbr- 20 
mation layer under consideration, or one of a small 
number of reserved "action" words {e.g., commands) 
which are established to cause the system to take cer- 
tain actions. Thus the hyperlink words, along witii the 
action words, serve as a portion of the vocabulary of the 2s 
speech recognizer. The action words, which are 
reserved and tiierefbre cannot be used as hyperlink 
words, are of the sort: "stop", iDaick", "^rT, "slower", 
"faster", etc., and generally would be established by the 
system operator, tt is of course preferable that the set of 30 
action words be small, and tiiat tiie same set be main- 
tained in common across all applications of the model. 

The speech recognition function fbr tiie system of 
the invention is particularly easy to implement because 
the speech recognizer generally needs only be able to 3S 
recognize a small vocabulary of words at any given 
point in time - the vocabulary of hyperlink words and 
action words. To aid recognizer performance, a sliding 
window of hyperlink words may be used to define the 
recognizer vocabulary, so that, at any given point in 40 
time, that vocabulary wouM include tiie most recently 
played hyperlink word and some number of hyperlink 
words enunciated earlier (but, in general, less than tiie 
total of ail previously played links). Accordingly, by using 
a sliding window (which tracks the enunciator) Ibr the 4S 
speech recognizer vocabulary, comprising a given 
hyperlink word and tiie additional words within some 
interval (which may include additional hyperlink words), 
the word recognizer need only be able to recognize 
hyperlink words appearing in that interval (plus the sys- so 
tem action words). Moreover, because tiie TTS system 
which provides the enunciation of those hyperlink words 
is part of the same system as the word recognizer, the 
word recognizer and the TTS system are able to share 
certain speech data, such as phonemic sequences of 55 
hyperlink words, which helps to keep tiie TTS system 
and tiie recognizer's "window" synchronized. 

Upon recognition by the word recognizer of a hyper- 



link word spoken by the user, a signal is tiien generated 
indicating that a particular hyperlink word has been 
selected by the user. Using metiiodologies analogous to 
those used in a purely text-based hypertext system, this 
recognition of a particular hyperlink word operates to 
cause tiie system to jump to the information layer linked 
to ttiat hyperlink word. When that linked layer is 
reached, the text in that layer is similarly translated to a 
voice form fbr communication to tiie user, and will be 
subject to furtiier user response as to ti)e selection of 
hyperlink words or system action words witiiin tiiat new 
layer. As with existing text-based technologies such as 
tiie World Wide Web. one or more of tiie linked infornrta- 
tion layers may well reskle in storage media associated 
witti servers operating at otiier locattons. where tiiat link 
is established via a communications patti between a 
first server and the linked server. 

Note also tiiat at any layer, part or ail of the infor- 
mation may be prerecorded human voice arxJ stored 
audio information, such as provided over tiie Worid 
Wkfe Web by streaming audio » e.g. . RealAudio™ from 
Progressive Networks, Inc. In this case, hyperlink words 
may be distinguished by recording such hyperiink words 
in a voice of opposite gender from that used fbr other 
text. 

IL impleniQptg^tiQn of th9 mwtrative Process 

In Figure 2, a system for implementing the method 
of tiie inversion is depicted. Referring to tiiat figure, a 
set of HTML pages representing an information data- 
base of interest will be provided in Data Storage 202, 
which, along with associated HTML Sender 203. com- 
prise Primary Serving Node 201 -P. Note, however, tiiat 
sub-layers or related portions of tiie information set may 
be stored at Remote Serving Nodes 201-R1-201-Rm, 
each such Remote Serving Nodes including an HTML 
Server and an associated Data Storage means. Each 
Remote Serving Node will in turn be linked to Voice 
Sending Node 215. and other serving nodes via a Data 
Network 205 - e.^. the Internet. 

In response to a request for access to that data set 
(e.0.,ttirough tiie arrival of a phone call from User's 
Audio Terminal 101 ttirough PSTN 110), the Automatic 
Call Distributor 225 in Voice Serving Node 215 assigns 
an available Voice Serving Unit 216-1 to tiie sen^e 
request. In the assigned Vok;e Serving Unit, tiie HTML 
Client 250 will cause the first of tiie HTML pages (the 
"home page") to be called up from Primary Serving 
Node 201 for further processing by tiie assigned Voice 
Serving Unit. (Primary Serving Node 201 may be collo- 
cated witii Voice Serving Node 215.) The HTML home 
page (supplied by HTML Server 203 from Data Storage 
202 in Primary Sending Node 201 to HTML Client 250 in 
Voice Serving Unit 216-1] will tiien typically be t-ans- 
lated to a voice form by Voiced Translation means 210. 
which will typically be realized with a TTS system. Note 
tiiat tiie voiced form of some or all HTML "pages" m£^ 
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have been obtained and stored prior to the user's 
access/request, not necessarily imnrtediately following 
that access/request Caching techniques, well Known In 
the art, may determine which voice forms will be pre- 
stored, and which generated in response to a user 
request 

The voiced text from the HTML home page will then 
be transntitted over communications link 211 to the 
Barge-In RIter 230. from which it can be heard by the 
user through User's Audio Terminal 1 01 . As the user lis- 
tens to the HTML page being enunciated by the Voiced 
Translation means, he may hear a hyperlink word for 
which he wishes to obtain additional or related detail (or 
to trigger a transaction as described below); to Indicate 
this desire for such additional or related detail, he will 
repeat (speak) the hyperlink word through User's Audio 
Terminal 101. That voiced response from the user is 
processed through Barge-In Filter 220 and transmitted 
to Speech Recognizer 240 over communications link 
221. 

An important function of Barge-In RIter 220 Is to 
ensure that only the words uttered by the user (exclud- 
ing the words enunciated by the Voiced Translation 
means) are inputted to Speech Recognizer 240. Such 
Barge-in Rlters are known In the art and operate by 
subtracting electrical signals generated from a known 
source (Voiced Translation means) from the total mix of 
that known source and user-uttered words; for the pur- 
poses of this disclosure, the Barge- In Rfter is also 
understood to operate as an echo canceler, compensat- 
ing for the imperfections in the transmission path 
between the user aind the Voice Serving Unit 

Speech Recognizer 240 synchronizes its recogni- 
tfon vocabulary (with hyperlink words that may be 
uttered by the user over time) through Communications 
Link 222 from Voiced Translation means 210. Upon rec- 
ognition of a selected hyperlink word by the Speech 
Recognizer, a signal related to that word is sent from the 
Recognizer to the HTML Client 250 which converts that 
signal Into an appropriate code for the HTI^L Sender as 
. Indicative that a hyperlink shouM be established to the 
information layer/location linked to the selected hyper- 
link word - this action is analogous to a user clicking a 
mouse with the cursor pointed at the hyperlink word and 
the system response thereto. 

Rgure 3 presents a more detailed view of some of 
the salient functions presented In Figure 2. In particular, 
Rgure 3 presents the functions wNch perform the TTS 
process of Voiced Translation 210 (which includes con- 
ventional Text-To-Phoneme Translation processor 315 
and Phoneme-To-Sound Conversion processor 317), 
the Hypertext kJentifk^ation processor 310, which oper- 
ates on a stream of text available from an HTML docu- 
ment page, a Hypertext-to-Phoneme Con-elator 320, for 
conrelatlon of kJentif led hypertext with phoneme strings, 
and a Window RIter 330, which determines which of the 
identified sequences of hypertext text should be used by 
a Speech Recognition processor 350 as part of the 



vocabulary for the recognizer system. 

In accordance with the described embodiment, a 
given HTML document page (for aural presentation to a 
system user) is retrieved by HTML Client 250 from Pri- 

5 mary Serving Node 201 and made available for further 
processing. The given HTML document page Is ana- 
lyzed by Hypertext Identifkation processor 310 to kien- 
tify the hypertext on the page. An output from Hypertext 
Identification processor 310 is provided to Hypertext -to- 

10 Phoneme Correlator 320. and a signal derived from tiiat 
output is provided to Phoheme-To-Sound Conversion 
processor 317, in order to facilitate differ^tial voicing 
between the hyperlink words and other text in the HTML 
page. 

15 The text on the document page is also provkied to 
Voiced Translation (TTS) system 210 for conversion to 
speech. This is accomplished through a conventional 
two-step process of translating text to sequences of 
phonemes by Text-To-Phoneme Translation processor 

20 31 5 and a phoneme to sound conversion by Phoneme- 
To-Sound Conversion processor 317. 

Correlated hypertext and phoneme sequences are 
presented to a Window Filter 320 which identifies which 
of the hyperlink words/phrases tinat have been played to 

25 tiie user up to a given time will form the vocabulary of 
the speech recognizer (afong wHh the system action 
words). This Window Filter 330 will select the most 
recentiy played hypertext arid all preceding hypertext 
within a certain duration in the past (which couki be 

30 measured in, for example, seconds or words). The Win- 
dow Filter 330 receives synchronization information 
concerning the words most recentiy played to the user 
from Phoneme-To-Sound processor 317 via Communi- 
cations Unk 318. The results of the window filter proc- 

35 ess - i.e., the sequence of hyperlink words/jDhrases 
occurring wittiin tiie duration of a given window ~ are 
stored in a Database 340 along witii phoneme models 
of such speech (typically implemented as independently 
trained hkkien Markov models (HMMs)). Database 340 

40 will, of course, also contain phoneme models of the sys- 
tem action words. A conventional Automatic Speech 
Recognition processor 350 receives unknown speech 
from tiie user (via Barge- In RIter 220 and Communica- 
tions Unk 221 ) and operates to recognize the speech as 

45 one of the current vocabulary of hyperlink words or a 
system action word. The Speech Recognition processor 
350 Interacts witii Database 340 to do conventional - 
e.^., Viterbi - scoring of ttie unknown speech witii tfie 
various models in tiie database. Upon recognition of a 

50 hyperlink word/phrase or a system action woiti, an out- 
put of tiie recognizer system is provkied to Primary 
Sending Node 201 for action appropriate to the selected 
hyperlink word (e.g., retrieval of the commensurate 
HTML "page") or the system action word. 

55 Window Filter 330 may be flat-weighted , admitting 
all hyperlink words enunciated in tiie predefined time- 
window into tiie vocabulary of tiie Speech Recognizer 
with equal probability; alternatively the Window Filter 
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may provide time-defined, "contextual smoothing", 
admitting more recently-enunciated hyperlink words 
into the vocabulary of the Speech Recognizer with 
higher probability than words articulated earlier in the 
recognition window. These probabilities are taken into 5 
account by Speech Recognition processor 350 when 
performing recognition. 

Certain systiam-action words refer to the activity of 
the Phoneme-to-Sound conversion means {e.g., 
"faster", "slower". ...). When such words are recognized 10 
by Speech Recognition processor 350, the signal iden- 
tifying eaich of them is transmitted to the Voiced Transla- 
tion means for appropriate action. 

It should also be understood that pre-recorded 
voice or audio content {e.g., music) can be used, mther is 
than enunciated text-to-speech, at any point within the 
user experience. When human voice is desired rather 
than enunciated text, then operation of tiie system is as 
illustrated in Figure 4. As can be seen in the figure, the 
data source in this embodiment consists of HTML 20 
Server 201 along with Streaming Audio Server 410 
(each such server including an appropriate storage 
means). Note that HTML Server 201 and Streaming 
Audio Server 410 may be implemented as a single 
server or separately, and each may consist of multiple 25 
physical servers, collocated or remote. The data pro- 
vided by HTML Server 201 is textual HTML pages as 
with the previously described embodiment. For the 
Streaming Audio Sender, however, the data content 
comprises prerecorded speech segments correspond- 30 
ing to a portion or all of a set of hypertext data to be 
made available to the user - such speech segments 
typically being estaUished by humans recording the 
hypertext data material as a precise reading script In 
an exemplary embodiment the textual portion of the 3s 
data in question is read (and recorded) in a male voice, 
and the hyperlink words are read in a female voice (hav- 
ing a distinct pitch from the male voice) . Any segment to 
which a link can be established is recorded separately. 
Playout of the streanting audio segments will be control- 40 
ledbytheHTMLSen/er. 

The system operation for this embodiment pro- 
ceeds as described for Figure 3 except that tiie user is 
presented with streaming-audio playback (for at least 
selected data segments) instead of enunciated voice. 4S 
All hyperlink words played out over Communications 
Link 310 penetrate through the Hyperlink Voice Discrim- 
inator 41 7 into the Hyperiink Words Text and Voice Syn- 
chronization means 420. Hyperiink Voice Discrimination 
operates to distinguish voicing for hyperiink words from so 
that for other text - in the exemplary embodiment, to 
discriminate the female voice (hyperlink words) from the 
male voice (other text). As before, Hyperiink Text Identi- 
fication means 310 feeds hyperiink words (text form) 
through, this time to Hyperiink Words Text and Voice ss 
Synchronization means 420, which operates, in a man- 
ner well known in tiie art, to track ttie progress of the 
streaming audio hyperlink words witii the textual version 
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of the same words, thus providing required synchroni- 
zation signals to Window RIter 330. The user interfaces 
witii tiie system in exactly the same manner, and tiie 
Speech Recognizer means operates as before. When a 
hyperiink word is recognized, tiie HTML Client is trig- 
gered as before, and the HTI^L Sender causes the 
Streaming Audio Server to move to the requested prere- 
corded segments and continue playing that new seg- 
ment to the user. 

III. Application of Methodology of Invftntinn 

Embodiments of the present invention can solve 
many problems associated witii conventional voice 
information systems. For example, conventional voice 
information systems are often difficult to design and 
use. This difficulty stems from the problem of designing 
a "user friendly" system for presenting a variety of 
optk)n8 to a listener, often In nested hierarchial form, 
from which tiie listener must select by pressing touch- 
tone keys. The difficulty of tills design task manifests 
itself to any user who, for example, encounters for tiie 
first time an automated transaction system at a banking 
or brokerage institution. Users often complain that 
nested hierarchies of voice "menus" are difficult to navi- 
gate tiirough. By contrast, ttie present invention pro- 
vides a much more intuitive interface to navigate 
tiirough information and select desired options. Witii tiie 
present invention, a user speaks ttie options the user 
desires, facilitating more intuitive (e.y., hands-free, 
eyes-free) and successful encounter with tiie system. 
Additionally, with the metiibd of the invention, the user is 
much more likely to be aware of options available when 
selecting a specific option, because of tiie way the infor- 
mation is presented and tiie multiple spoken language 
options available at any point. There is no need to asso- 
ciate concepts witii numbers as in many prior-art meth- 
ods. 

The invention also solves a problem of sate-of-the- 
art voice recognition systems concerning the recogni- 
tion of free-form, unconstrained phrases. By presenting 
a browser with spoken hyperlink words to be repeated, 
tiie system "knows" in advance tiie limited set of words 
tiiat are likely to be spoken by a listener in selecting 
hypertexL As such, the system can recognize virtually 
any spoken word or phrase a voice-Information system 
designer may devise. The designer is not limited to 
selecting a small vocabulary (corresponding to tfie 
voice information system context) for use in recognition, 
or among a few alternatives, just to maintain recognizer 
accuracy. 

Also, embodiments employing the window filter 
facilitate enhanced voice recognition performance 
tiirough tfie use of a temporal limit minimizing tiie 
vocabulary of tiie recognizer Thus, ttie recognizer does 
not attempt (or need) to recognize an words (selected 
by the designer as hypertext) an of tiie time. This 
improves recognizer performance, since correct recog- 
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nition is morid difficult when a vocabulary is large due to, 
for example, the presence of an vwords a riBoognizer 
needs to recognize over time and their possQale syno- 
nyms. 

The Invention also allows the designers of voice 
information ^tems to take advantage of plentiful 
HtML authoring tools, making design of such systems 
easy 

Other benefits of embodiments of the present 
invention include the designation of a recorded path 
through information space which can be replicated later 
in expanded HTML media -- for exanr^le, a user can 
navigate in information space using a telephone, then 
direct the system to deliver text and associated images 
(encountered along the same path) to a fax machine or 
as an attachment to an electronic mail message; the 
opening of parts of the WWW to users of telephones 
and sight-impaired users of PCs; the integrated use of 
voice messaging and e-mail; and the affording of a more 
general applicability of voice infbrmatton systems 
around the world in locations which do not entpioy 
touch-tone telephones. 

IV. SfiEKjuslfiQ 

A system and method for vok;ed interaction with a 
stored information set has been described herein that 
provkles fbr the presentation of an information set of 
greater complexity than that handled by the prior art, as 
well as a substantially simpler and more intuitive user 
interface. In an exemplary application of the invention, 
an entity wishing to make a collection of information 
available to a set of users, or potential users, would 
cause that infomfiation to be authored Into a set of linked 
HTML pages, which HTML data wouM be loaded into a 
storage medium associated with one or more serving 
nodes. A means for accessing the serving node, such 
as a toll-free telephone number, would then be estab- 
lished. Typically, information as to the availability of the 
Iriformatioh set (as well as the means for access) would 
be published and/or advertised to users and/or potential 
users. Upon accessing the serving node, a user would 
be greeted by an enunciation of text appearing in the 
"Home Page" of the HTML database, where hyperlink 
words in that Home Page are enunciated in a distinct 
manner from that of the regular text. The user would 
then "barge in", after hearing a hyperlink word as to 
which more irtformation is sought (during an adjustable 
time window after the hyperlink word is enunciated), by 
repeating that hyperlink word. That "barge in" repeat of 
the hyperlink word would be recognized (from multiple 
such words "active" within that time window) by a 
speech recognizer associated with the serving node, 
and that recognition would be translated to a signal Indi- 
cating selection of tiie particular hyperiink word, caus- 
ing tiie server to aeate a hyperlink to the HTML point 
linked to that hyperlink word, or to trigger a transaction 
such as the buying or selling of stocks, or the linking of 



a user's telephone to that of another for a subsequent 
conversation. 

Although the present embodiment of the invention 
has been described in detail, it shouki be understood 

5 ttiat various changes, alterations and substitutions can 
be made tiierein witiiout departing from the spirit and 
scope of tiie Invention as defined by the appended 
claims. In particular, tiie system may be modified such 
that, upon recognition of a hyperlink word voiced by a 

10 user, that word is repeated back to the user as a confir- 
mation of his choice. In tiie absence of a user response, 
such as Vrong" or "stop" within a short interval, tiie sys- 
tem would proceed to implement tiie hyperlink to the 
HTML layer linked to tiiat word. As an additional modifi- 

15 cation of the system and method described herein, an 
HTML page containing graphic data (which, of course, 
cannot be conveyed orally) could be structured so that a 
phrase such as Image here' would be voiced to indi- 
cate tiie presence of such an image. As an additional 

20 feature, the system could be caused to inten'ogate a 
user Indicating an interest In such image to provide the 
user's fax number, whereupon a faxed copy of the page 
containing the image of interest could be sent to tiie 
user's fax machine. As a still further modification, por- 

25 tions of tiie data could be stored in an audio form, and 
tiie preserttation of tiiat axx&o data made to the user 
establishing a connection to tiie serving node via a 
technology known as streaming audio, a well-known 
WWW technque for providing to an HTML client real- 

30 time digitized audio information. 

Further, tiie process by which a user navigates from 
tiie voiced HTML Home Page to tiie voiced detail of 
bwer, or related, layers in accordance witti tiie present 
invention provkles all of tiie advantages of an interactive 

35 voice response ("IVR") system - as. for example, witti 
automated call attendant systems, but without the need 
to deal witti limiting, and often frustrating menu sti'uc- 
tures of IVRs. Instead, tiiat navigational process would 
work consistentiy, regardless of specific content In 

40 essentially the same way ais tiie text-based navigation 
of the World Wide Web arrangement where a user pro- 
ceeds from a Home Page down to a layer representing 
Information of interest And, as is well known, that 
WWW HTML system not only provkles a highly versatile 

45 information access medium, but it has also been shown 
to have an essentially intuitive navigation scheme which 
is substantially user friendly Thus, a user need only 
learn tiie model fbr this interface once, and tiiereafler 
will find tiiat an interaction with a separate database 

50 using ttiat model provides a corresponding navigation 
scheme regardless of tiie contents of the database. 
Moreover, tiie underlying data reached by accessing a 
hyperiink word will behave in a corresponding manner 
to that of the model learned. 

55 In tiie auttioring of tiie information of interest into 
HTML pages, it will be preferable that tiie hyperlink 
words/phrases be relatively "compact" -- i.e.. typically 
containing one or two words - and sparse, in order to 
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both enhance recognition system performance and 
make the method of the invention more useful to a user. 

Where technical features mentioned in any claim 
are followed by reference signs, those reference signs 
have been included for the sole purpose of increasing s 
the intelligibility of the claims and accordingly, such ref- 
erence signs do not have any limiting effect on the 
scope of each element identified by way of example by 
such reference signs. 

10 

Claims 

1. A method of facilitating the selection of a hyperlink 
from among a plurality of hyperlinks presented to a 
user in audio form, said method comprising the is 
steps of: 

furnishing to said user a first signal represent- 
ing speech utterances of saki plurality of hyper- 
links and one or wotb other words, wherein 20 
sakJ signal includes an identificatbn of said 
hyperlinks; 

performing speech recognition on a second 
signal representing speech uttered by said 
user, said speech recognition being performed ss 
with use of a recognizer vocabulary which com- 
prises entries corresponding to at least two of 
said hyperlinks. 

2. The method of Claim 1 wherein said furnishing step 30 
comprises synthesizing saki speech utterances 
represented in said first signal based on a corpus of 
teict. said corpus including saki plurality of hyper- 
links. 

35 

3. The method of Claim 2 wherein saki corpus of text 
comprises text of a document provided by a compu- 
ter networic server. 

4. The method of Claim 3 wherein said document 40 
comprises an HTML page. 

5. The method of Claim 1 wherein at least a portion of 
said speech utterances represented in saki first sig- 
nal comprise prerecorded human voice, or 45 

wherein said utterances of hyperiinks fur- 
nished to said user are aurally distinct from said- 
utterances of saki other words, or 

wherein said first and second signals are 
can-ied over a telephone network, and further so 
wherein at least a portion of a routing of said sig- 
nals is based on a recognized hyperiinK or 

wherein sakf at least two of said hyperiinks 
includes less than all hyperlinks in said plurality of 
hyperiinks. 55 

6. The method of one or more of Claims 1-5 further 
comprising the step of selecting said recognizer 



vocabulary entries to be a subset of all hyperlinks 
furnished to said user, or 

further conprising the step of selecting saki 
at least two of said hyperiinks in accordance with a 
terrporal window defining a subset of said plurality 
of hyperiinks. 

7. The method of Claim 6 wherein the step off perform- 
ing speech recognition further comprises selecting 
a particular hyperiink as a recognition result from 
among sakJ at least two of sakJ hyperlinks based on 
a temporal Ideation of a partfoular hyperiink within 
saki window. 

8. The method of Claim 1 wherein said hyperiink 
entries comprising said recognizer vocabulary for 
saki speech recognition are limited to a subset of 
saki plurality of hyperlinks which have occurred In 
saki first signal during a predefined interval. 

9. The method of Claim 8 wherein each of said hyper- 
link entries comprising said recognizer vocabulary 
has an equal likelihood of representing an unknown 
speech utterance In saki second signal, or 

wherein any one of said hyperiink entries 
comprising sakJ recognizer vocabulary has a likeli- 
hood of representing an unknown speech utterance 
in saki second signal which is weighted according 
to a tenporai position of saki any one hyperlink in 
said predefined interval. 

10. The method of one or more of Claims 1-9 further 
comprising the step of causing a predefined action 
to be carried out based on a recognized hyperlink, 
or 

further comprising the step of initiating a 
transaction based on a recognized hyperiink. or 

further comprising the step off peribrming a 
transactfon based on a recognized hyperiink. 

1 1 . The method of Claims 2 or 3 further comprising the 
step of identifying a second corpus of text based on 
a recognized hyperlink. 

12. The method of Qalm 11 wherein said second cor- 
pus of text comprises text of a document focated on 
a computer network server. 

13. A system for facilitating the selection of a hyperiink 
from among a plurality of hyperiinks presented to a 
user in audio form, said system comprising: 

an interface providing to said user a first signal 
representing speech utterances of saki plural- 
ity of hyperlinks and one or more other words, 
wherein said signal includes an identification of 
said hyperiinks; 

a speech recognizer for peribrming speech rec- 
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ognition on a second signal representing 
speech uttered by said user, said speech rec- 
oghttion being performed with use of a recog- 
nizer vocatxjiary which comprises entries 
con'esponding to at least two of said hyper- 5 
links. 

14. The system of Clalnfi 13 wherein said Interface 
operates to synthesize said speech utterances rep- 
resented in said first signal based on a corpus of 10 
text, said coipus including said plurality of hyper- 
links. 

15. The system of Claim 14 wherein said corpus of text 
comprises text of a document provided by a oompu- 15 
ter network server. 

16. The system of Claim 13 wherein at least a portion 
of sakJ speech utterances represented in said first 
signal comprise prerecorded human voice, or 20 

wherein identification of said hyperlinks in 
said first signal is earned out by providing said utter- 
ances of hyperlinks furnished to said uiser in an 
aurally distinct form from said utterances of said 
other words, or 26 

wherein said first and second signals are 
carried over a communications networK and further 
wherein at least a portion of a routing of said sig- 
fials is based on a recognized hyperlink, or 

wherein vocabulary entries for said reoogni- 30 
tion means are selected to be a subset of all hyper- 
links furnished to said user, or 

further including means for selecting sm6 at 
least two of siaid hyperlinks in accordance with a 
tentporal window defining a subset of said plurality 35 
of hyperlinks. 

17. The system of Claim 16 wherein said speech rec- 
ognizer selects a particular hyperlink as a recogni- 
tion result from among said at least two of sakl 40 
hyperlinks based on a temporal location of a partic- 
ular hyperlink within said window. 

18. The system of Claim 13 wherein said hyperlink 
entries comprising said recognizer vocabulary for 45 
said speech recognizer are limited to a subset of 
saxJ plurality of hyperlinks whk:h have occurred in 
saki first signal during a predefined interval. 

19. The system of Claim 18 wherein each of said so 
hyperlink entries comprising said recognizer vocab- 
ulary has an equal likelihood of representing an 
unknown speech utterance in sakJ second signal, 

or 

wherein any one of said hyperlink entries ss 
comprising said recognizer vocabulary has a likeli- 
hood of representing an unknown speech utterance 
in said second signal which is weighted according 
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to a temporal position of saki any one hyperlink In 
said predefined Interval. 

20. The system of one or more of Claims 13-19 includ- 
ing a means for causing a predefined action to be 
carried out based on saki recognized hyperlinK or 

further comprising a transaction Initiator 
which initiates a transaction based on a recognized 
hyperlink, or 

further comprising a transaction processor 
for performing a transaction based on a recognized 
hyperlink. 

21. The system of Claims 14 or 15 including a means 
for identifying a second corpus of text based on a 
recognized hyperlink 

22. The system of Claim 21 wherein said second cor- 
pus of text comprises text of a document located on 
a computer network server. 

23. A voiced Information interface system comprising: 

a database of information including text having 
one or more corresponding information links; 
a means operating in conjunction with saM 
database Ibr causing infbrmatton to be pro- 
vided in voiced form; 

a means for recognizing a voiced response by 
a us^ in relation to said provided information; 
and 

a means for shifting to information related to at 
least one of saki Information links In response 
to sakl recognized user response. 

24. The voiced information interlace system of Claim 

23 wherein said database of information is 
arranged as a plurality of information layers and a 
linkage between saki Information layers is provided 
by said information links. 

25. The voiced information interface system of Claim 

24 wherein said information links are provided as 
identified information segments in a given informa- 
tion layer. 

26. The voiced infbmnation interface system of Claim 

25 wherein said information in saki given informa- 
tion layer is provided as a plurality of textual words, 
or 

wherein saki voiced response by a user is 
constituted as a repeat of one of saki kientrfied 
information segments in said given layer. 

27. The voiced information interface system of Claim 
23 wherein said means for causing inforrmtion to 
be provided in voiced form includes a further means 
for causing saki information links to be provkied in 
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an aurally distinct manner from other voiced infor- 
mation. 

28. The voiced Information interface system of Claim 
27 wherein said further means for causing said 5 
information links to be provided in an aurally distinct 
manner operates to cause said information links to 
be voiced in an opposite gender voice from that of 
said other voiced information. 

10 

29. The voiced information interface system of Claim 
23 wherein said voiced response by a user is con- 
stituted as a direction for a predefined action by 
said system, or 

including a further means to provide a conf ir- is 
mation of said voiced response to sakj user, or 

including a further means for providing to a 
user graphical information appearing in said data- 
base of information. 

20 

30. The voiced information interface system of Claim 
29 wherein said graphical information is provided to 
said user via a graphical access channel means. 

31. A method for providing voiced access to stored 2S 
information, wherein said Information includes text 
having one or more corresponding information 
links, comprising the steps of: 

causing at least a portion of said information to 30 
be provided in voiced form; 
recognizing a voiced response by a user in 
relation to said provided Information: and 
shifting to information related to at least one of 
sakJ information links in response to said rec- as 
ognized user response. 

32. The method for providing a voiced access to stored 
information of Claim 31 wherein said information is 
anranged as a plurality of informatfon layers and a 40 
linkage between said information layers is provkied 

by sakJ infbrmalion links. 

33. the method for providing a voiced access to stored 
information of Claim 32 wherein said information 4S 
links are provided as xientified infomiation seg- 
ments in a given information layer. 

34. The method for providing a voiced access to stored 
information of Claim 33 wherein said information in so 
said given layer is provided as a plurality of textual 
words, or 

wherein voiced response by a user is consti- 
tuted as a repeat of one of saki klentif led informa- 
tion segments in said given layer. ss 

35. The method for providing a voiced access to stored 
information of Claim 31 wherein said step of caus- 



ing Information to be provkied in voiced form 
includes a substep of causing sakJ Information links 
to be provided in an aurally distinct manner from 
other voiced information. 

36. The method for provkiing a voiced access to stored 
infomiation of Claim 35 wherein said substep of 
causing said information links to be provided in an 
aurally distinct manner operates to cause sakJ infor- 
mation links to be voiced in an opposite gender 
voice from tiiat of said otiier voiced information 

37. The method for prodding a voiced access to stored 
infomfiation of Claim 31 wherein said voiced 
response by a user is constituted as a direction for 
a predefined action, or 

including a further step of providing a confirma- 
tion of said voiced response to saki user, or 
including a furtiier step of providing to a user 
graphical infomiation appearing in said stored 
information. 

38. The method for providing a voiced access to stored 
information of Claim 37 wherein said graphical 
information is provided to saki user via a graphical 
access channel means. 

39. A system for providing an interface to a stored data- 
base of information comprising: 

a means for provkJing sakJ database of infor- 
mation as a set of linked information layer, 
wherein sakl information is stored in an audio 
form; 

a means for causing a particular layer of sakI 
information to be provided to a user; 
a means for recognizing a voice response by 
sakI user in relation to information in said pair- 
ticular layer; and 

a means for operating on said recognized user 
response to effect a shift from sakf particular 
layer to a linked layer. 

4a A system for provkiing an interfoce to a stored data- 
base of information comprising: 

a means for establishing said database of infor- 
mation as a set of linked information layers, 
where linkage between such layers is related to 
linkage words in particular information layers; 
a means operating in conjunction with sakI 
stored Information layers for causing informa- 
tion in a given layer to be provided in voiced 
form, wherein said linkage words in said given 
layer are provided in an aurally distinct manner 
from other information in said given layer; 
a means for recognizing a voiced response by 
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a user in relation to one of said linkage words in 
said given layer; and 

a means for operating on said recognized 
voiced user response to effect a shift from said 
given layer to another layer linked to said link- s 
age word. 
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